259 research outputs found

    Compilation and Exploitation of Parallel Corpora

    Get PDF
    With more and more text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as a translation aid for second-language learners, for translators and lexicographers, or as a data-source for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely aWeb concordancer and the extraction of bi-lingual lexica

    A corpus-based study of 16th-century Slovene clitics and clitic-like elements

    Get PDF
    This paper undertakes a corpus-based linguistic investigation of the spelling variation in 16th century Slovene both from the diachronic and synchronic points of view. The investigation is based on a manually annotated sample (approx. 14,000 word tokens) from Primož Trubar’s Ta pervi deil tiga Noviga teſtamenta, 1557, and Hiſhna poſtilla, 1595, and Jurij Juričič’s Poſtilla, 1578, and it concentrates on clitics and clitic-like elements. Statistical analysis, based on comparison of the spelling conventions of the early modern period to those of contemporary Slovene using normalised forms of the originals, where we observe cases where one orthographic word is nowadays written as two or more words (1–n mapping) or vice-versa (n–1 mapping), shows that the overall percentage of split and joined word tokens is 5.7%, with JPo 1578 having the highest percentage, and TPo 1595 the lowest, less than half of that of JPo 1578. Of these, the vast majority is for cases where a word is now split. The most predominant among the bound words are non-syllable prepositions v ‘in(to)’, k ‘to’, and z ‘with’, followed by negative proclitic ne ‘not’, enclitic particle li ‘whether, if’ and in rare instances conditional particle bi, reflexive particle se, na ‘on’, ob ‘at, by’, pri ‘at, beside’ and za ‘for, behind’ (the absolute numbers of specific clitics partially correlate with the prevalence of bound variants in comparison with the freestanding variants of those clitics, with the most frequent being predominantly bound while the least frequent are predominantly freestanding). Individual instances of two accented words written together can be attributed to German influence (figino_drevo, der Pfeigenbaum ‘fig tree’). The cases where one modernised word correlates to two original words are, with the exception of superlative adjective/adverb prefix naj-/nar- ‘the most’ that is orthographically bound with its root in about 25% of instances, sporadic or can be identified as errors in the original books. Of interest are also cases when beginnings of words that are homonymous with non- or one syllable prepositions are separated from the remainder of the word with an apostrophe (eg. s’_nameinja ‘signs’, s’_derſhati ‘to endure’, do_bruta ‘goodness’, sa_doſti ‘enough’). The normalisation also enables the identification of the orthographical variants of the most commonly bound clitics, i. e. non-syllable prepositions k, z and v. K and its allomorph /h/ have 5 attested spelling variants, of which one is limited to hosts starting with a v-. For z with a voiced allomorph /z/ and voiceless allomorph /s/ three variant spellings were discovered that only partially correspond with a voiceless/voiced distinction of the initial sound of the host word, and the cases of merging with the host that begins with s-/z- were identified. Additional positional spellings probably represent other allomorphs: for palatalized /ž/ in front of a palatal ń and , >ſo/so> for syllabified /za/, /zo/. The preposition v shows the highest degree of orthographical variation of all analysed words as it has 10 different spellings: general bound and and freestanding ; , and in front of a vowel; and attested only in front of a v-, as well as and merged with the initial v- of the host. The analysis of spelling variation in non-syllable prepositions showed that even a relatively limited hand-corrected annotated sample enabled identification of majority of spelling variants identified in previous works, while with the use of noSketch Engine tool further information about their relative frequency and distribution was obtained. As the hand-corrected corpus is expanded such research will yield even more relevant information for the study of the 16th century Slovene literary language that will significantly supplement existing findings (based on traditionally collected examples) with the help of a large amount of statistically relevant data

    Korpusi in konkordančniki na strežniku nl.ijs.si

    Get PDF
    V prispevku predstavimo referenčne, specializirane in vzporedne korpuse, do katerih je mogoče dostopati prek konkordančnikov na strežniku nl.ijs.si. Večina korpusov vsebuje besedila v slovenščini, nekaj pa je tudi tujejezičnih. Mnogi od korpusov obstajajo že dalj časa, vendar so sedaj na novo označeni, pri nekaterih so dodana nova besedila, nekateri pa so v celoti novi. Besedila v korpusih so opremljena z metapodatki, besednim pojavnicam pa so ročno ali avtomatsko pripisane vsaj leme in oblikoskladenjske oznake. V večini primerov so korpusi prosto dostopni, in sicer prek dveh spletnih konkordančnikov, ki omogočata iskanje po obsežnih označenih korpusih, ponujata bogat nabor analitičnih orodij, možnosti filtriranja glede na metapodatke in shranjevanje rezultatov na lastni računalnik. Poleg korpusov in obeh konkordančnikov v prispevku obravnavamo tudi nekatera vprašanja, ki so se zastavila pri zagotavljanju tovrstne infrastrukture za namene korpusnega jezikoslovja, ter zaključimo s smernicami za nadaljnje delo

    Dealing with Abbreviations in the Slovenian Biographical Lexicon

    Full text link
    Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.Comment: To be presented at The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022

    The Parla-CLARIN Recommendations for Encoding Corpora of Parliamentary Proceedings

    Get PDF
    Parliamentary proceedings are a rich source of data that can be used by scholars in various humanities and social sciences disciplines. Unlike the sources of most other language corpora, parliamentary proceedings are not subject to copyright or personal privacy protections, and are typically available online, thus making them ideal for compilation into corpora and for open distribution. For these reasons many countries have already produced corpora of parliamentary proceedings, but each typically in their own encoding, limiting their comparability and utilization in a multilingual setting. In this paper we propose an encoding schema which could serve as an interchange format for parliamentary corpora compiled for the purposes of scholarly investigations. The schema, called Parla-CLARIN, was developed within the CLARIN research infrastructure, and is written as a TEI ODD which includes a TEI customization and prose guidelines with examples of use. We discuss the coverage and choices made in designing the recommendations, and give an overview of the guidelines. We also discuss two other standard schemas for encoding parliamentary data, Akoma Ntoso and RDF, and their relation to Parla-CLARIN. We conclude by presenting corpora already encoded in Parla-CLARIN and discussing further work, especially the provision of a set of example documents and of transformation scripts that would make the proposed encoding more usable
    corecore